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ABSTRACT 

The  partial  correlation  (PARCOR)  coefficients  of  the 
least  squares  lattice  filter  may  be  used  to  conveniently  and 
efficiently  represent  various  types  of  acoustic  signals. 
Because  a  stationary  time  series  may  be  represented  by  a 
small  number  of  PARCOR  coefficients,  the  PARCOR  coefficients 
have  been  widely  used  as  effective  pattern  recognition 
parameters  for  the  representation  and  transmission  of 
information.  This  thesis  establishes  the  PARCOR 
coefficients  of  the  least  squares  lattice  filter  as 
efficient  and  effective  pattern  recognition  features  for  the 
classification  and  identification  of  synthesized  steady 
state  vowel-like  sounds.  The  PARCOR  coefficient  technique 
is  shown  to  be  a  much  quicker  and  more  computationally 
efficient  method  of  vowel  identification  than  identification 
by  formant  frequencies,  which  involves  the  computation  of 
poles  and  zeros  and  the  back-calculation  of  formant 
frequencies  and  formant  bandwidths. 

It  is  well  documented  in  the  literature  that  steady 
state  vowel  sounds  may  be  identified  and  classified 
according  to  their  formant  frequencies.  The  formant 
frequencies  of  each  vowel  may  be  shown  to  cluster  together, 
and  the  vowel  clusters  may  be  separated  from  one  another  in 
the  space  defined  by  some  subset  of  the  formant  frequencies 
(Barney,  1952;  Peterson,  1952;  Peterson  &  Barney,  1952; 
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Potter  &  Steinberg,  1  950  ) 


When  a  spoken  vowel  is  presented  in  time  series  form, 
utilization  of  these  clustering  properties  requires  a 
transformation  from  time  domain  to  frequency  domain  (formant 
frequencies)  which  is  quite  complicated  and  computationally 
expensive.  A  more  efficient  vehicle  for  classifying  steady 
state  vowel-like  sounds  is  developed  by  this  author  to  be 
the  forward  PARCOR  coefficients  K®,  i= 1 , 2 ,  .  .  .  p  which 
arise  naturally  as  intermediate  parameters  in  a  pth-order 
least  squares  complex  adaptive  lattice  filter  (Hodgkiss  & 
Presley,  1981,  1982). 


The  formant  frequency  data  used  in  this  study  are  those 
measured  by  Peterson  and  Barney  (1952;  Barney,  1952; 

Potter  &  Steinberg,  1950),  obtained  through  the  courtesy  of 
the  Bell  Laboratories  Archives.  A  time  series  for  each 
vowel  utterance  is  generated  from  three  formant  frequencies 
using  a  six-pole  HR  digital  recursive  filter.  The  time 
series  are  then  inverse  filtered  via  a  six-zero  complex 
adaptive  lattice  filter  (Alexandrou  &  Hodgkiss,  Note  1; 
Hodgkiss  &  Presley,  1981,  1982),  producing,  for  each 
utterance,  a  set  of  six  PARCOR  coefficients. 


The  PARCOR  coefficients  produced  by  the  lattice  exhibit 
the  same  clustering  properties  as  do  the  formant 
frequencies;  namely,  minimum  cluster  size  (average 
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intracluster  distance)  in  two  dimensions  for  all  vowels,  and 
maximum  cluster  separability  (intercluster  distance)  in  six 
dimerisions  for  selected  adjacent-vowel  pairs.  As  a  combined 
measure  of  compactness  and  separability,  the  ratio  of  the 
sum  of  average  intracluster  distances  to  intercluster 
distance  for  each  of  the  ad jacent-vowel  pairs  yielded 
roughly  equivalent  results  for  the  formant  frequencies  and 
PARCOR  coefficients.  Graphically,  the  first  two  PARCOR 
coefficients  are  sufficient  for  the  identification  of  the 
first  nine  vowels,  whereas  the  third  PARCOR  coefficient  is 
necessary  for  identification  of  the  tenth  vowel,  /i/ .  These 
results  are  analogous  to  those  observed  for  the  clustering 
of  formant  frequencies. 
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CHAPTER  I 


INTRODUCTION 

The  partial  correlation  (PARCOR)  coefficients  of  the 
least  squares  lattice  may  be  used  to  conveniently  and 
efficiently  represent  various  types  of  acoustic  signals. 
Because  a  stationary  time  series  may  be  represented  by  a 
small  number  of  PARCOR  coefficients,  the  PARCOR  coefficients 
have  been  widely  used  as  effective  pattern  recognition 
parameters  for  the  representation  and  transmission  of 
information.  This  thesis  establishes  the  PARCOR 
coefficients  of  the  least  squares  lattice  as  efficient  and 
effective  pattern  recognition  features  for  the 
classification  and  identification  of  synthesized  steady 
state  vowel-like  sounds.  The  PARCOR  coefficient  technique 
is  shown  to  be  a  much  quicker  and  more  computationally 
efficient  method  of  vowel  identification  than  identification 
by  formant  frequencies,  which  involves  the  computation  of 
poles  and  zeros  and  the  back-calculation  of  formant 
frequencies  and  formant  bandwidths.  Even  though  this  thesis 
addresses  identification  of  vowels,  it  is  clear  that  the 
PARCOR  coefficients  may  be  used  to  identify  characteristics 
of  other  acoustic  and  electromagnetic  signals. 

It  is  well  documented  in  the  literature  that  steady 
state  vowel  sounds  may  be  identified  and  classified 
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according  to  their  formant  frequencies.  The  formant 
frequencies  of  each  vowel  may  be  shown  to  cluster  together, 
and  the  vowel  clusters  may  be  separated  from  one  another  in 
the  space  defined  by  some  subset  of  the  formant  frequencies 
(Barney,  1  952;  Peterson,  1  952  ;  Peterson  &  Barney,  1  952; 
Potter  &  Steinberg,  1950) 

When  a  spoken  vowel  is  presented  in  time  series  form, 
utilization  of  these  clustering  properties  requires  a 
transformation  from  time  domain  to  frequency  domain  (formant 
frequencies)  which  is  quite  complicated  and  computationally 
expensive.  A  more  efficient  vehicle  for  classifying  steady 
state  vowel-like  sounds  is  developed  by  this  author  to  be 
the  forward  PARCOR  coefficients  K®  ,  i=l,2,  .  .  .  p  which 

arise  naturally  as  intermediate  parameters  in  a  pth-order 
least  squares  complex  adaptive  lattice  filter  (Hodgkiss  & 
Presley,  1981,  1982). 

The  specific  purpose  of  this  research  is  to  establish 
the  value  of  the  PARCOR  coefficients  as  efficient  and 
effective  pattern  recognition  features  for  the 
classification  and  identification  of  (synthesized)  steady 
state  vowel-like  sounds.  Specifically,  the  intent  of  this 
thesis  is  to  show  that  for  steady  state  vowel-like 
utterances,  the  PARCOR  coefficients  of  each  vowel  will 
cluster  together,  and  that  the  ten  English  vowels  may  be 
separated  from  one  another  in  the  space  defined  by  some 


subset  of  the  PARCOR  coefficients  in  much  the  same  way  as 
the  formant  frequencies  cluster  and  separate  the  vowels.  In 
other  words,  the  inverse  filtering  procedure  may  be 
considered  as  a  change  of  variable  (Turner,  1982);  the 
PARCOR  coefficients  behave  in  this  manner  when  the  related 
formant  frequencies  themselves  exhibit  a  clustering 
behavior . 

A  time  series  for  each  vowel  utterance  is  generated 
from  three  formant  frequencies  using  a  six-pole  IIR 
recursive  digital  filter.  The  formant  frequency  data  are 
those  measured  by  Peterson  and  Barney  (1952;  Barney,  1  952; 
Potter  &  Steinberg,  1950),  obtained  through  the  courtesy  of 
the  Bell  Laboratories  Archives.  A  six-zero  complex  adaptive 
least  squares  lattice  filter  is  used  as  an  inverse  filter  on 
each  time  series,  producing,  for  each,  a  set  of  six  PARCOR 
coefficients . 

Considerable  motivation  exists  for  the  development  of  a 
system  identification  technique  which  does  not  require  the 
calculation  of  formant  frequencies  and  bandwidths  from  a 
time  series.  Specifically,  in  the  field  of  speech 
processing  (Markel,  1  972  ,  1  97  3;  McCandless,  1  974;  Wakita  & 
Kasuya,  1977),  researchers  have  commonly  calculated  the 
coefficients,  aj  ,  of  the  denominator  polynomial  from  the 
transfer  function  of  an  inverse  filter  and  then  obtained  the 
formant  frequencies  from  the  roots  of  the  polynomial.  As 
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stated  previously,  this  is  a  complicated  calculation.  For 
applications  where  frequencies  of  the  formants  are  not 
important  but  desired  for  vowel  identification  clustering,  a 
change  of  clustering  variable  to  the  PARCOR  coefficients 
would  eliminate  the  expensive  and  complicated  computation. 

Vowel  Identification  by  Formant  Frequency 

During  voiced  speech,  when  the  vocal  tract  is  excited 
by  the  glottal  source,  the  spectral  peaks  which  occur  are 
referred  to  as  the  formants  of  the  particular  speech  sound. 
For  the  majority  of  male  speakers  the  first  three  formants 
lie  in  the  ranges  150-850  Hz.,  500-2500  Hz.,  and 
1700-3500  Hz..  Formants  for  women  and  children  are  higher 
in  frequency,  due,  in  part,  to  the  smaller  size  of  their 
vocal  mechanisms  (Fant,  1956). 

Some  speech  sounds,  such  as  steady  state  vowels,  may  be 
identified  or  characterized  by  their  formant  frequencies  and 
the  bandwidths  and  levels  of  those  formant  frequencies. 

When  different  speakers  speak  one  of  the  vowels,  the 
utterances  are  different  for  each  speaker.  In  the 
perceptual  space  defined  by  the  frequencies  of  the  formants 
(which  is  referred  to  as  the  formant  space),  these 
differences  manifest  themselves,  for  each  vowel,  as  a 
cluster  of  points  around  some  average  value.  In  the  speech 
literature,  the  first  three  formants  are  used  widely  for 
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adequate  vowel  identification,  (Peterson,  1952;  Peterson  & 
Barney,  1  952  ;  Potter  &  Steinberg,  1  950  ),  although 
considerable  evidence  has  been  offered  in  the  literature 
both  for  and  against  the  necessity  of  more  formants,  formant 
bandwidths,  and/or  formant  amplitudes  (Peterson,  1952; 
Peterson,  1961;  Bernstein,  1981;  Potter  &  Steinberg, 

1950),  and  fundamental  frequencies  of  excitation  (Foulkes, 
1961;  Peterson,  1961)  for  vowel  identification.  A  popular 
three-dimensional  mechanism  for  vowel  identification  is  a 
plot  of  the  first  three  formant  frequencies  F,  ,  Fg ,  F3 ,  in 
the  perceptual  coordinate  space  defined  by  the  axes  Fl,  F2, 
and  F3.  A  base  10  logarithmic  scale  is  usually  used  to 
account  for  the  nonlinearities  of  the  ear.  The  spatial 
position  of  the  articulators  and  the  vocal  mechanism  at  any 
point  in  time  directly  affect  the  frequency  position  of  the 
resonances  of  the  vocal  tract  by  changing  the  relative  sizes 
of  different  parts  of  the  tract. 

The  location  of  a  vowel  in  the  "formant  space"  defined 
by  Fl  and  F2  corresopnds  to  the  spatial  location  of  the 
tongue  hump  in  the  two-dimensional  representation  of  the 
oral  cavity  for  that  vowel.  In  other  words,  the  formant 
frequency  space  classification  seems  to  have  a  physical 
significance.  Figure  1  shows  the  configuration  of  the 
articulators  for  the  ten  English  vowels  (adapted  from 
Potter,  Kopp,  &  Kopp,  1  966  ,  chap.  12)  and  their 
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corresponding  relative  tongue  hump  positions  in  the  vowel 
quadrilateral  (adapted  from  Denes  &  Pinson,  1  963,  p.  55). 

If  the  vowels  are  classified  based  on  these  tongue  hump 
positions;  (i.e.,  /i/,  as  in  the  word  "heed"  is  a  high, 
front  vowel,  whereas  /?/ ,  as  in  the  word  "hawed"  is  a  low 
back  vowel),  the  classifications  are  similar  to  those 
obtained  using  the  formant  space  plots.  The  Peterson  and 
Barney  study  is  discussed  and  results  of  graphical  and 
quantitative  analyses  of  the  formant  frequency  data  are 
presented  in  Chapter  II. 

Acoustic  Tube  Vocal  Tract  Models 

There  are  many  applications  where  the  PARCOR 
coefficients  have  a  direct  physical  relationship  to  sound 
generation  mechanisms.  One  such  case  is  the  generation  of 
speech  sounds  by  the  human  vocal  system.  The  vocal  tract  is 
often  modeled  as  an  acoustic  tube  (Dunn,  1961;  Flanagan, 
1972;  Markel  &  Gray,  1  976,  chap.  4;  Wakita,  1  97  3a,  1973b, 
1979;  Wakita  &  Gray,  1  975)  excited  by  either  a  voiced  or 
unvoiced  source  somewhere  along  its  length,  with  appropriate 
boundary  conditions  which  depend  on  the  circumstances  of 
phonation.  The  PARCOR  coefficients  of  the  lattice  structure 
have  a  direct  physical  relationship  to  the  reflections 
between  sections  of  the  acoustic  tube.  Specifically,  the 
area  ratio  between  successive  sections  is  defined 
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(Wakica,  197  9,  p.  281)  as 
Aj*  1-Kj 

Aj  1+Kj 


j=l,  2, 


P- 


where  the  Aj  are  the  areas  of  the  sections  of  the  acoustic 
tube  and  the  Kj  are  the  PARCOR  coefficients  for  a  pth-order 
model.  Wakita  (1973b)  suggested  that  the  area  function  of 
the  acoustic  tube  (determined  from  a  ladder  implementation 
of  the  linear  prediction  technique)  could  be  used  to  detect 
obstructions  in  the  vocal  tract.  If  the  tube  is  considered 
to  be  lossless,  of  length  L,  constant  cross-sectional  area, 
and  closed  at  the  vocal  fold  end  while  open  at  the  lip  end, 
resonances  occur  at  the  frequencies  corresponding  to  L=nX/4? 
n= 1 ,  2,  .  .  .  .  where  X  is  the  wavelength  of  the 
fundamental  resonant  frequency. 


Autoregressive  Model  for  Vowel  Representation 

The  acoustic  tube  model  of  the  vocal  tract  lends  itself 

directly  to  a  mathematical  model  for  a  spoken  vowel  sound. 

Non-nasal  vowels  have  been  modeled  widely  in  the  speech 

literature  as  autoregressive  (AR)  processes  of  order  p, 

generated  by  passing  a  white  input  time  series,  v(n), 

through  a  pth-order  all  pole  filter  with  transfer  function 

1  Z(z) 

H  (  z  )=  -  =  - 

A(z)  V(z) 

P  . 

where  A(z)=l-£  ai  z 


and  the  output  process. 
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z  ( n  )=  X  a.  z ( n- j  )+v(n)  . 


H 


1 


The  inverse  of  a  pth-order  AR  process  is  a  pth-order 
moving  average  (MA)  process,  generated  by  passing  the  input, 
v(n),  through  a  pth-order  all  zero  filter  with  transfer 
function 

2(  z) 


H  (  z  )  =  B  (  z  )  = 


V(z) 


p  P 

where  B(z)=l-£  bw  z  and  z(n)=£  bk  v(n-k). 
Ml  Ml 


The  least  squares  lattice  filter  used  in  this  study  is 
a  feed-forward  MA  (all-zero)  lattice.  This  inverse  lattice 
filter  produces  a  whitened  output  spectrum  by  placing  a  zero 
at  the  location  of  each  pole  in  the  input  spectrum.  Optimal 
whitening  is  obtained  when  the  following  conditions  are 
satisfied  (because  of  the  one  to  one  correspondence  of  zeros 
to  poles):  1)  AR  input  processes  are  used  as  the  lattice 
input,  and  2)  the  order  of  the  lattice  filter  chosen  to  be 
equal  to  the  order  of  the  input  AR  process.  The  fitting  of 
an  AR  model  to  a  time  series  is  equivalent  to  the  method  of 
maximum  entropy  spectral  analysis  (Burg,  1  967  ;  Macina, 

1981;  Papoulis,  1981;  Parzen,  1  974;  Ulrych  &  Bishop, 
1975).  The  maximum  entropy  technique  assumes  maximum 
uncertainty  with  respect  to  the  unknown  information  about 
the  signal  (outside  the  sampling  interval). 
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Because  the  aim  of  this  research  is  to  illustrate  the 
pattern  recognition  capabilities  of  the  PARCOR  coefficients 
for  data  with  known  formant  frequencies  (data  which  had 
previously  been  shown  to  cluster  in  the  frequency  domain), 
the  vowels  are  modeled  as  sixth-order  AR  processes  and  the 
lattice  is  specified  to  be  sixth-order  to  maximize  the 
accuracy  of  the  PARCOR  coefficients.  The  modeling  of  a 
vowel  sound  as  a  sixth-order  AR  process  is  a  gross  over¬ 
simplification  in  terms  of  speech  production.  However,  it 
is  a  necessary  one  if  the  fundamental  intent  of  the  research 
is  to  be  respected.  Generation  of  vowel  utterances  from 
formant  frequencies  is  discussed  in  Chapter  III. 

The  Linear  Prediction  Problem 

The  inverse  filter  is  determined  through  the  use  of  the 
linear  prediction  technique.  This  technique  has  been  used 
widely  in  the  speech  field  as  well  as  in  geophysics  and 
neurophysics  (Landers  &  Lacoss,  1  977;  Makhoul,  1  97  5;  Wood 
&  Treitel,  1975)  for  time  series  modeling.  The  linear 
predictive  technique  estimates  the  properties  of  a  signal  by 
modeling  a  sample  as  a  linear  combination  of  past  samples 
and  minimizing  some  form  of  the  error  between  the  actual  and 
predicted  samples.  The  most  common  implementations  of  the 
linear  predictive  technique  have  been  the  autocorrelation 
and  covariance  methods  (Makhoul,  1975;  Markel  &  Gray,  1976, 
chap.  9;  Rabiner  &  Schafer,  1  978). 
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Lattice  Method  of  Linear  Prediction 

The  linear  prediction  technique  may  also  be  implemented 
as  a  lattice  algorithm,  which  is  recursive  in  time  and 
order.  A  lattice  structure  has  several  advantages  over  the 
autocorrelation  and  covariance  methods.  Most  importantly, 
the  physical  structure  of  the  lattice  filter  is  composed  of 
cascaded  filter  stages;  a  pth-order  lattice  filter  may  be 
decomposed  into  all  filters  of  up  to  and  including  pth- 
order.  The  pth-order  lattice  structure  simultaneously 
generates  outputs  of  all  lesser  order  filters.  Lattice 
models  are  naturally  related  to  physical  models  such  as  the 
scattering  and  propagation  of  waves  in  a  stratified  medium 
( Friedlander ,  1980).  The  PARCOR  coefficients  of  the  lattice 
are  also  related  to  the  reflections  between  layers  of  the 
medium  being  modeled.  This  type  of  physical  meaning  is  not 
apparent  for  the  polynomial  filter  coefficients,  aj  ,  which 
are  obtained  from  the  PARCOR  coefficients  by  a  nonlinear 
recursion.  The  lattice  model  is  directly  applicable  to  the 
study  of  transmission  theory,  seismic  signal  processing,  and 
underwater  sound  propagation  as  well  as  speech 
communication.  For  instance,  geophysicists  have  used  ladder 
structures  in  their  study  of  structural  features  of  the 
earth's  subsurface  (Burg,  1967;  Robinson  &  Treitel,  1980; 
Wiggins  &  Robinson,  1  965;  Ulrych  &  Bishop,  1  975).  The 
lattice  method  of  linear  prediction  was  first  presented  by 
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Itakura  &  Saito  (1971)  and  is  well  known  in  the  speech  field 
as  an  analysis  tool  (Atal  &  Hanauer,  1971;  Flanagan,  1972; 
Makhoul,  1975).  Also,  hardware  implementations  of  lattice 
filters  have  been  successfully  marketed  as  effective 
synthesis  devices  for  the  compression  and  transmission  of 
speech  signals.  The  Texas  Instruments'  "Speak  &  Spell"  game 
is  an  example  of  this  technology.  Researchers  in  the  speech 
field  have  used  the  lattice  based  on  an  acoustic  tube  model 
of  the  vocal  tract  (Markel  &  Gray  1  976,  chap.  4;  Wakita, 

197  3a,  197  3b;  Wakita  &  Gray  1975).  In  addition  to  the 
obvious  physical  significance  of  the  lattice  structure, 
other  advantageous  features  of  the  lattice  method  of  linear 
prediction  include  a  recursive-in-time  implementation 
(rather  than  block  processing),  faster  convergence, 
insensitivity  to  eigenvalue  spread,  better  numerical 
behavior,  robustness,  and  insensitivity  to  roundoff  noise 
( Friedlander ,  1  982a;  Lee,  Morf  &  Friedlander,  1981; 

Makhoul,  1978;  Markel  &  Gray,  1976,  chap.  9).  Lattice 
filters  have  found  application  in  the  fields  of  noise 
cancelling,  channel  equalization,  seismic  signal  processing, 
speech  processing,  system  identification,  frequency 
tracking,  spectral  estimation,  and  spectrum  prewhitening 
(Friedlander,  1  982b;  Hodgkiss  s,  Alexandrou,  1  98  3;  Hodgkiss 
&  Presley,  1981,  1982;  Lee,  Morf,  &  Friedlander,  1981; 
Satorius  &  Alexander,  1  97  9;  Satorius  &  Pack,  1981; 

Satorius  &  Shensa,  1980a). 
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The  least  squares  lattice.  The  least  squares  lattice 
recursions  were  first  obtained  by  an  algebraic  approach 
(Morf,  Lee,  Nickolls  &  Vieira,  1977;  Morf,  Vieira,  &  Lee, 
1977).  The  least  squares  lattice  (LSL)  structures  proposed 
by  Morf  et.  al  are  more  efficient  numerically  than  the 
gradient  lattice  algorithms  (Griffiths,  1975,  1977,  1978), 
requiring  only  0(p)  operations  per  time  update,  where  p  is 
the  order  of  the  filter.  The  least  squares  lattice 
structures  are  known  for  their  good  stability  properties, 
rapid  startup,  excellent  convergence  properties  and  fast 
parameter  tracking  capabilities  (Hodgkiss  &  Presley,  1981; 
Lee,  1980;  Lee,  Morf,  &  Friedlander,  1981;  Satorius  & 

Pack,  1981;  Satorius  &  Shensa ,  1980b).  These  advantages 
are  a  direct  result  of  two  lattice  parameters  which  account 
for  the  algorithmic  differences  between  the  gradient  LSL 

formulations;  an  exponential  weighting  parameter, 

( 1-  qclsL^  '  anc*  a  Gaussi-an  likelihood  step  size  parameter, 

Xj_2  (n-1).  These  parameters  are  discussed  in  Chapter  IV. 
Certain  assumptions  must  be  made  about  the  behavior  of  the 
waveform  outside  the  time  interval  of  observation.  The 
lattice  structure  used  in  this  research  prewindows  the  data, 
i.e.,  it  assumes  that  z(n)=0  for  all  n<0 .  The  LSL 
Algorithms  deveolped  by  Morf  et.  al  were  adapted  for  complex 
data  by  Hodgkiss  and  Presley  (1981,  1982)  and  then 
programmed  in  FORTRAN  by  Alexandrou  and  Hodgkiss  (Note  1). 

A  section  of  this  program  was  incorporated  into  this 
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CHAPTER  II 

VOWEL  IDENTIFICATION 

The  concept  of  formant  space  clustering  is  the  result 
of  work  done  at  Bell  Laboratories  during  the  years 
1947-1951.  A  study  of  sustained  vowels  was  undertaken  by 
G.  E.  Peterson,  H.  L.  Barney,  R.  K.  Potter,  J.  C.  Steinberg 
and  others  to  investigate  the  relationship  between  spoken 
and  perceived  vowels  and  their  acoustical  correlates 
(Barney,  Note  2;  Peterson  &  Barney,  1952;  Potter  & 
Steinberg,  1950).  The  vowels  used  were  the  ten  English 
vowels  in  a  consonant-vowel-consonant  (CVC)  context  with  /h/ 
as  the  first  consonant.  For  the  Bell  Laboratories  studies, 
a  total  of  76  speakers  including  33  men,  28  women,  and  15 
children  each  recorded  two  lists  of  10  words  (each  word 
contained  a  vowel  in  the  context  /h_d/).  The  vowels  and 
corresponding  symbols  and  words  adapted  from  those  of  the 
International  Phonetic  Alphabet  (IPA)  are  presented  in 
Table  1.  The  majority  of  the  male  speakers  spoke  General 
American  English.  The  words  were  recorded  and  played  to  a 
group  of  70  listeners  who  identified  the  vowel  in  each  of 
the  words.  The  vowel  portion  of  each  CVC  utterance  was  also 
analyzed  with  a  sound  spectrograph  to  determine  the 
frequency  positions  of  the  first  three  formants.  Since  the 
results  of  the  listening  tests  showed  the  effects  of  the 
diverse  dialectal  backgrounds  of  the  listeners,  a  sub-group 


Symbol 

Data  points 

CVC  syllable 

Utterance 

/i/ 

65 

/hid/ 

heed 

/I/ 

51 

/hid/ 

hid 

/e/ 

35 

/h£d/ 

head 

/*/ 

56 

/haed/ 

had 

/a/ 

37 

/had/ 

hod 

/!/ 

45 

/h7d/ 

hawed 

/u/ 

55 

/hUd/ 

hood 

/u/ 

60 

/hud/ 

who '  d 

/A/ 

37 

/hAd/ 

hud 

/V 

65 

/h3d/ 

heard 

o 

Symbols  are  those  of  the  International  Phonetic  Alphabet 
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of  26  observers  with  similar  characteristics  was  chosen 
(Barney,  Note  2).  Of  the  1520  vowels  presented,  1203  were 
identified  unanimously  by  the  26  observers. 

When  the  first  and  second  formants  of  the  vowels  are 
plotted  against  each  other,  the  vowels  appear  in  essentially 
the  same  positions  as  they  do  in  the  vowel  quadrilateral 
(Peterson  &  Barney,  1952).  Vowels  may  be  separated  on  the 
basis  of  their  locations  in  the  space  defined  by  the  first 
two  formants,  except  for  the  vowel  /2f/,  identified  by  its 
third  formant,  which  is  lower  than  that  of  the  other  vowels 
(Potter  &  Steinberg,  1950).  The  distribution  is  continuous 
in  the  F1-F2  plane  in  going  from  vowel  to  vowel;  the 
overlap  between  owels  is  characteristic  of  the  differences 
in  the  way  various  individuals  articulate  and  pronounce  the 
vowels  (Peterson  &  Barney,  1952).  The  distributions  for 
each  vowel  tend  to  be  elongated,  elliptical  areas  along 
lines  which  pass  through  the  origin,  indicating  that 
although  formant  ratios  are  not  exactly  constant,  they  do 
tend  to  be  helpful  for  the  identification  of  some  vowels 
(Potter  &  Steinberg,  1950). 

Analysis  of  the  Formant  Frequency  Data 

Fundamental  and  formant  frequency  data  for  these  1203 
utterances  were  obtained  through  the  courtesy  of  Bell 
Laboratories  Archives  for  use  in  this  study.  Because  Potter 
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and  Steinberg  (1950)  determined  that  formant  frequency 
positions  for  a  man's  voice  differed  from  those  of  a  woman's 
for  the  same  vowel,  the  present  author  has  restricted  this 
study  to  data  from  male  utterances  to  control  for 
fundamental  frequency.  This  author  has  plotted  the  vowel 
data  used  for  this  study  in  the  manner  of  Peterson  and 
Barney.  Data  for  four  of  the  utterances  are  eliminated  from 
the  study  because  they  occurred  outside  the  limits  of  three 
standard  deviations  for  a  particular  vowel.  The  remaining 
number  of  data  points  per  vowel  are  listed  in  Table  1.  The 
base  10  logarithms  of  formant  frequencies  are  normalized  so 
that  the  frequency  range  of  the  entire  set  of  data  falls 
within  the  interval  [0,1].  Table  2  presents  the  ranges  of 
normalized  formant  frequencies.  The  widest  range  is  spanned 
by  F2,  the  second  widest  by  Fl.  This  is  consistent  with  the 
recognition  of  F2  in  the  literature  (Potter,  Kopp,  &  Kopp, 
1966,  pp.  74-75)  as  a  primary  feature  of  voiced  speech, 
especially  the  movement  of  F2  in  identifying  dipthongs  (as 
in  "how",  "hoe",  "hay",  "high",  "hoist"). 


Graphical  Anal1 


Figures  2  and  3  present  the  vowel  clusters  for  all  ten 
vowels  in  the  F1-F2  and  F1-F3  planes.  The  same  clusters  are 
presented  separately  by  vowel  in  Figures  4-13  for  the  F1-F2 
plane,  and  the  cluster  for  the  vowel  /3/  in  the  F1-F3  plane 
is  shown  in  Figure  14.  Each  data  point  appears  as  the 
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Table  2 


Normalized  and  Un-norma li zed  Ranges  for 
Formant  Frequencies  and  PARCOR  Coefficients 


Cluster  variable 

Minimum 

Maximum 

Un-normalized  values 

Formant  frequencies  (1-3) 

1  90 

3400 

PARCOR  coefficients  (1-6) 

-0. 9674 

0. 9971 

Values  normalized  to  [0,1] 

Formant  frequencies 

log  F, 

0.0000 

0.5235 

log  F£ 

0. 3747 

0. 9201 

log  F3 

0.6924 

1.000 

PARCOR  coefficients 

K1 

0  .0000 

0.6484 

K2 

0. 30  20 

1.000 

K3 

0.0  320 

0 .4346 

K4 

0 . 3071 

0 . 9425 

K5 

0 .2027 

0 .6200 

K6 

0.8858 

0  .  930  3 

NORMALIZED  L< 
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Figure  3.  Clustering  of  the  ten  English  vowels  in  the 
F1-F3  plane. 
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of  the  vowel  /£/  in  the  F1-F2  plane. 
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Figure  8.  Clustering  of  the  vowel  /a/  in  the  F1-F2  plane. 
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Figure  9.  Clustering  of  the  vowel  /.?/  in  the  F1-F2  plane. 
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Figure  11.  Clustering  of  the  vowel  /u/  in  the  F1-F2  plane. 
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Figure  12.  Clustering  of  the  vowel  /A/  in  the  F1-F2  plane. 
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Figure  13.  Clustering  of  the  vowel  /3/  in  the  F1-F2  plane. 


symbol  for  the  particular  vowel  which  it  represents.  The 
symbols  are  somewhat  modified  versions  of  those  of  the 
International  Phonetic  Alphabet,  due  to  the  limited 
availability  of  IPA  symbols  for  plotting  purposes.  The 
results  mentioned  previously  as  reported  by  Peterson, 

Barney,  Potter,  and  Steinberg  are  reproduced  and  verified  by 
the  present  author.  Each  vowel  cluster  on  the  formant  plots 
is  enclosed  by  a  solid  line.  The  exact  outline  of  each 
cluster  is  arbitrary;  the  outlines  are  intended  to  indicate 
a  general  cluster  shape  for  the  purpose  of  evaluating 
separability  in  a  graphical,  qualitative  manner. 

Distance  measures 

Selected  quantitative  distance  measures  and  cluster 
sizes  are  computed  for  the  formant  frequencies  in  two 
(Fl,F2)  and  three  (F1,F2,F3)  dimensions  (see  Appendix  A). 
Although  the  actual  distance  measures  (Tou  &  Gonzales,  1974, 
p.  77)  used  are  arbitrary,  a  set  of  dimensionless 
measurements  is  necessary  to  allow  comparison  of  the  cluster 
sizes  and  vowel  separability  in  the  formant  space  with  that 
in  the  PARCOR  space.  The  average  intracluster  distance  for 
each  cluster  is  computed  as  the  average  Euclidean  distance 
between  each  of  the  data  points  in  the  cluster  (normalized 
log  frequency)  and  the  centroid  of  the  cluster.  The 
intercluster  distance  is  computed  as  the  Euclidean  distance 
between  the  centroids  of  selected  adjacent  vowel  cluster 
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pairs.  Ideally,  the  average  intracluster  distances  should 
be  minimized  for  maximum  cluster  compactness  and  the 
intercluster  distances  should  be  maximized  for  maximum 
cluster  separability.  The  average  intracluster  distances 
for  each  cluster  and  the  intercluster  distances  for  selected 
adjacent  vowel  pairs  are  tabulated  in  Tables  3  and  4, 
respectively.  The  average  intracluster  distance,  a  measure 
of  cluster  compactness,  for  each  vowel  is  minimum  in  two 
(F1,F2)  dimensions,  whereas  the  intercluster  distance,  a 
measure  of  vowel  separation,  is  maximum  in  three  (Fl,F2,F3) 
dimensions.  The  ratio  of  the  sum  of  average  intracluster 
distances  to  intercluster  distance  for  each  of  the  adjacent- 
vowel  pairs  is  also  computed  in  two  and  three  dimensions, 
(presented  in  Table  5).  This  measurement  is  meaningful  when 
compared  with  values  computed  for  the  PARCOR  coefficients 
(Chapter  5).  The  distance  measures  substantiate  the  results 
of  the  graphical  analysis:  sufficiency  of  the  F1-F2  plot  to 
identify  the  first  nine  vowels  and  the  F1-F3  plot  to 
separate  them  from  the  tenth  vowel,  /T/. 


Cluster  dimension 


Vowel 

F(l,2) 

F  ( 1 , 2 , 3 ) 

K(l,2) 

K  ( 1 , 2 , 3  ) 

K ( 1-6  ) 

/i/ 

0.0456 

0.0514 

0 . 1  30  3 

0.1418 

0.1623 

/I/ 

0.0424 

0.0464 

0.0848 

0 .0  926 

0.1069 

/€/ 

0.0337 

0 .0  37  5 

0  .0627 

0.0715 

0.0834 

/*/ 

0.0312 

0.0360 

0.0583 

0.0699 

0  .0830 

/a/ 

0 .0263 

0  .0401 

0.0155 

0 .0650 

0.1027 

/2/ 

0  .0496 

0 .0577 

0  .0116 

0  .0408 

0.0  920 

/u/ 

0  .0461 

0  .0533 

0  .0235 

0 .0  393 

0  .0737 

/u/ 

0.0775 

0.0815 

0  .0299 

0.0463 

0.07  97 

/A/ 

0  .0  365 

0 .0443 

0.0264 

0  .0527 

0  .0774 

/2T/ 

0.0426 

0  .0504 

0  .0448 

0  .0  37  3 

0  .0567 
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Table  4 

Intercluster  Distances  for  Adjacent-Vowel  Pairs  of 
Formant  Frequency  and  PARCOR  Coefficient  Clusters 

Cluster  Dimension 


Vowel  pair 

F(l,2) 

F  ( 1 , 2 , 3 ) 

K  ( 1  f  2  ) 

K(l,2,3) 

K  ( 1-6  ) 

i-I 

0.1408 

0.1482 

0.1828 

0 .1832 

0.2510 

I-£ 

0.1143 

0.1147 

0  .080  9 

0  .0  917 

0 .10  8  3 

£-ae 

0.0852 

0  .0861 

0  .0968 

0.1067 

0.1142 

ae-  J 

0.1388 

0.1824 

0.1288 

0.1842 

0.2921 

a- ? 

0.1616 

0.1619 

0  .0676 

0  207 

0.1372 

a- A 

0.0648 

0.0653 

0.0428 

0  .0577 

0.0625 

a-  3 

0.15  98 

0.2021 

0  .0881 

0 .2291 

0 . 3407 

?-u 

0.1179 

0 .1200 

0.0539 

0  .0572 

0.1071 

3 -A 

0 . 1 5  37 

0.1537 

0.0884 

0.1075 

0.1364 

U-u 

0.1376 

0 .1377 

0  .0229 

0  .0542 

0  .0774 

U- 

0.1060 

0.1400 

0.1155 

0.1479 

0.2  50  2 

K-1 

0.0  951 

0 .14  95 

0  .0  5  90 

0.1826 

0.2  936 
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Table  5 

Ratio  of  the  Sum  of  Average  Intracluster  Distances  to 
Intercluster  Distance  for  Adjacent-Vowel  Pairs  of 
Formant  Frequency  and  PARCOR  Coefficient  Clusters 


Vowel  pair 


F(l,2) 


Cluster  Dimension 
F ( 1 , 2 #  3 )  K { 1 # 2 )  K ( 1 , 2 , 3)  K(l-6) 


0.6249 

0.6599 

1.176 

1.279 

1.072 

0.6661 

0  .7  322 

1.824 

1.789 

1.757 

0.7624 

0.8542 

1.251 

1. 326 

1.457 

0.5312 

0 .47  39 

0 .7431 

0.6223 

0.4783 

0.46  94 

0.60  38 

0 .4005 

0.8761 

1.419 

0 . 96  95 

1.293 

0. 9804 

2.040 

2.87  9 

0 . 4  30  9 

0.4479 

0.5  996 

0.4788 

0.4679 

0.8118 

0 . 9247 

0.650  3 

1.401 

1.546 

0.5600 

0.6637 

0.4  300 

0.8688 

1.241 

0 .8982 

0 . 97  93 

1.780 

1.57  9 

1. 980 

0.8369 

0.7408 

0.5264 

0.5685 

0.520  9 

0 .8311 

0 .6338 

1.081 

0.5336 

0.4565 
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CHAPTER  III 

GENERATION  OF  SYNTHESIZED  VOWEL-LIKE  SOUNDS 

It  was  desired  to  reproduce  the  data  measured  at  Bell 
Laboratories  as  accurately  as  possible.  The  speech  spectrum 
may  be  adequately  represented  by  frequency  information  below 
4000  Hz.  (Denes  &  Pinson,  1963,  p.  140)  which  includes  the 
range  of  the  first  three  formant  frequencies  for  male 
speakers.  Since  each  formant  must  be  represented  by  a 
complex  conjugate  pole  pair,  a  six-pole  filter  is  a 
sufficient  representation  of  a  vowel.  A  sampling  frequency 
of  8000  Hz.  is  chosen,  following  Rabiner  and  Schafer  (1978, 
cha  p.  3)  . 

Digital  Models  for  the  Vocal  Tract 

Rabiner  and  Schafer  (1  978,  chap.  3)  represent  the  vocal 
tract  as  a  recursive  HR  digital  filter  with  transfer 
function 

1  1  Z  ( z ) 

h  ( z  )=  - p - ■.  =  - -  -  (i) 

1-  £  aj  z~l  A(  z  )  V (  z  ) 

is* 

where  p  is  the  order  of  the  filter.  For  stability,  the  z- 


plane  poles  corresponding  to  the  roots  of  this  equation  must 
lie  inside  the  unit  circle.  The  corresponding  output  AR 
speech  process,  z(n),  is  described  by  the  equation: 
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z(n)=a|  z ( n-1 )+a2 z{ n-2  )  +  .  .  .  +ap  z  (  n- p)  +v  (  n ) 

=  £  a-  z  (  n- j  )+v(n) 
jJ' 

Each  formant  of  a  vowel  is  related  to  a  complex  conjugate 
pair  of  zeros  of  the  polynomial  in  z-1  ,  A(z): 


zi  '  zj  =  e 


-AFj  T  +  j27TFj  T 
e 


where  Fj  is  the  jth  formant  frequency ,  AFj  is  the 
bandwidth  of  the  jth  formant,  and  T  is  the  sampling  period 
(Rabiner  &  Schafer,  1978).  So  that  the  filter  may  be 
realized  recursively,  the  polynomial  coefficients  are 
determined  by  evaluating  the  denominator  of  H(z),  where 


H(z)= 


( 1-Zj  z'1  ) 


and  equating  the  denominator  to  A(z)=l-  £  a:  z'J  . 


Physical  Model  of  Speech  Production 


The  physical  mechanism  for  speech  production  consists 
of  the  lungs,  bronchi,  trachea,  larynx,  pharynx,  and  nasal 
and  oral  cavities.  The  larynx,  which  includes  the  vocal 
folds,  is  the  principal  structure  for  voiced  speech. 

Complex  tones  are  produced  when  short  duration  air  pulses 
produced  at  the  glottis  (the  space  between  the  vocal  folds) 
excite  the  supralaryngeal  portion  of  the  vocal  tract. 
Alternately,  a  noise  source  may  be  produced  by  constricting 
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the  vocal  tract  (i.e.,  at  the  vocal  folds,  lips,  tongue,  or 
soft  palate),  causing  the  airstcearu  to  become  turbulent. 

For  unvoiced  speech,  the  noise  source  is  produced  without 
the  vibration  of  the  vocal  folds.  Here,  too,  the  vocal 
tract  acts  as  a  resonant  cavity  to  shape  the  resultant 
sound.  Figure  15  (from  Flanagan,  1972,  p.  24)  shows  the 
physical  system  in  terms  of  the  possible  mechanisms  for 
sound  generation  and  resonation. 

The  resonant  frequencies  of  the  lossless  tube  model  of 
the  vocal  tract  have  very  narrow  bandwidths.  Actually,  the 
vocal  tract  is  not  lossless.  The  cross  section  of  the  vocal 
tract  varies  continuously  over  the  length,  and  energy  losses 
occur  as  a  result  of  result  of  viscous  friction  between  air 
and  the  walls  of  the  tube,  heat  conduction  through  the  tube 
walls,  and  vibration  of  the  tube  walls  as  well  as  from 
losses  at  the  glottis  (vocal  folds)  and  lips.  These  losses 
are  frequency-dependent,  and  their  combined  effect  is  to 
change  the  positions  of  the  vocal  tract  resonances  and 
broaden  the  bandwidths  of  those  resonances.  (Rabiner  & 
Schafer,  1978,  p.  72). 


Mode  line 


Formant  Bandwidths 


The  bandwidths  of  the  Peterson  and  Barney  data  were 
measured  by  Bogert  (1953)  and  again  by  Dunn  (1961).  Bogert 
concluded  that  bandwidths  of  formants  are  invariant  and 
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Figure  15.  Schematic  diagram  of  functional  components  of 
the  vocal  tract.  (From  Flanagan,  1972,  p.  24) 
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independent  of  vowel.  Dunn  also  questioned  whether  changes 
in  bandwidth  from  vowel  to  vowel  are  critical  or  even 
necessary  for  correct  identification  of  synthetic  speech. 
Neither  the  bandwidth  data  measured  by  Dunn  nor  that 
measured  by  Bogert  were  available  for  use  in  this  study. 

The  required  bandwidth  values  for  each  formant  frequency  are 
supplied  by  averages  over  all  ten  vowels  (AF,  =  52.0, 

AF2  =  66 .0  ,  AF3  =  120  .0  )  from  sine  wave  bandwidths  for 
synthesized  vowels  determined  by  Dunn  with  his  electrical 
vocal  tract. 

A  series  of  experiments  conducted  by  this  author 
describes  the  effects  of  bandwidth  changes  on  the  PARCOR 
coefficients  of  a  single  formant  (two-pole)  system.  For  a 
series  of  single  formant  frequency  systems  with  formants 
incremented  over  the  range  250-3500  Hz.,  when  the  bandwidth 
of  the  formant  is  incremented  from  10-200  Hz.,  the  range 
(averaged  over  the  formant  frequency  experiments)  over  which 
Kl  varies  is  .0057  with  a  standard  deviation  of  .0034.  The 
average  range  over  which  K2  varies  is  .0115  with  a  standard 
deviation  of  .1361.  Likewise,  for  a  series  of  constant- 
bandwidth  single-formant  frequencies,  when  the  formants  are 
incremented  over  the  range  250-3500  Hz.  for  each  bandwidth 
in  the  range  10-200  Hz.  The  range  (averaged  over  the 
bandwidth  experiments)  over  which  Kl  varies  is  1.900  with  a 
standard  deviation  of  .0040;  the  average  range  of  K2  is 
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.0061  with  a  standard  deviation  of  .0255.  In  comparing  the 
ranges  of  K1  and  K2,  it  is  observed  that  the  effect  of 
formant  frequency  variations  on  the  PARCOR  coefficients, 
especially  K1  (which  has  been  previously  determined  by 
Tohkura  &  Itakura,  1979,  to  be  sensitive  to  variations  in 
pole  placement)  is  much  more  pronounced  than  the  effect  of 
bandwidth  variations,  which  are  practically  negligible. 
Incidentally,  it  is  noted  that  as  AFj  is  increased  for  any 
Fj  ,  K2  decreases.  As  Fj  is  increased  for  any  AFj  ,  Kl 
increases . 

Modeling  of  Vocal  Tract  Excitation 

Modeling  of  the  vocal  tract  excitation  function 
presents  a  problem  with  respect  to  the  generation  of  the 
synthesized  vowel-like  sounds.  The  primary  objective  in 
reproducing  the  Bell  Laboratories  data  is  to  accurately 
reproduce  the  formant  frequencies  measured  by  Peterson  and 
Barney.  Input  to  the  vowel  generation  prefilter  must  be 
white  in  order  to  obtain  an  AR  process  as  the  output. 
Ideally,  the  easiest  way  to  exactly  specify  the  frequency 
peaks  (formants)  of  the  output  speech  spectrum  is  to  specify 
them  as  the  peaks  of  the  transfer  function  of  the  filter, 
using  an  input  signal  with  a  constant  (flat)  frequency 
spectrum.  This  requirement  of  a  flat  spectrum  is  satisfied 
by  two  types  of  signals;  white  noise  and  a  periodic 
(deterministic)  impulse  train  (corresponding  to  unvoiced  and 
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voiced  excitation,  respectively).  Although  the  frequency 
spectrum  of  an  impulse  train  flat,  modeling  of  the 
excitation  as  an  impulse  train  is  less  than  desirable  in 
terms  of  the  accuracy  with  which  the  harmonics  of  the 
impulse  frequency  correspond  to  the  desired  frequency  peaks 
(formants)  of  the  output  signal. 

The  speech  spectrum  for  a  vowel  should  theoretically 
have  formant  frequencies  which  are  integer  multiples  of  the 
fundamental  frequency.  However,  the  fundamental  and  formant 
frequencies  measured  by  Peterson  and  Barney  do  not  exhibit 
this  tendency  for  several  reasons.  Primarily,  the  technique 
used  by  Peterson  and  Barney  to  measure  formant  frequency 
used  a  weighted  average  of  the  frequency  components  of  the 
spectral  peak  (Potter  &  Steinberg,  1950);  in  addition, 
factors  such  as  perturbation  in  actual  fundamental  frequency 
and  difficulty  in  interpreting  spectrograms  to  within  a  few 
Hertz,  as  well  as  measurement  and  roundoff  error  probably 
contributed  to  the  lack  of  relationship  between  measured 
fundamentals  and  their  formants.  Since  the  measured 
fundamentals  are  fairly  low  in  frequency,  their  harmonics 
are  sufficiently  far  apart  as  to  cause  the  reproduced 
formants  to  be  much  different  from  the  measured  (desired) 
formants . 

As  stated  earlier,  the  primary  objective  in  the 
generation  of  the  data  is  to  recreate  the  measured  formants. 
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Therefore,  the  fundamental  frequency  data  are  disregarded, 
and  the  input  excitation  is  chosen  to  be  a  white  noise 
sequence  with  zero  mean  and  standard  deviation  equal  to  one. 
This  may  be  likened  to  an  unvoiced  input,  as  though 
whispered  vowels  are  being  generated.  Although  the  power 
spectral  density  of  the  excitation  function  of  a  whispered 
vowel  is  not  exactly  white,  the  approximation  of  the  input 
as  white  noise  is  no  less  accurate  from  a  signal  processing 
standpoint  than  approximating  the  glottal  wave  (voiced 
input)  with  an  impulse  train. 

The  input  excitation  problem  is  illustrated  in 
Figure  16.  A  voiced  input  is  shown  in  Figure  16(b)  as  a 
periodic,  deterministic  impulse  train  of  frequency  f0  with 
harmonics  fn  ,  where  fn=nfQ.  An  unvoiced  input  is  modeled  in 
Figure  16(d)  as  white  noise,  with  frequency  components  at 
fn  ,  (limited  by  the  sampling  frequency  Fs  )  ,  where  fn=n/Fs. 
Comparing  the  output  spectra  of  Figure  16(c)  and 
Figure  16(e),  the  reproduced  formants  Fn  are  closer  to  the 
desired  formants  Fn ,  for  the  noise  input  of  Figure  16(d) 
than  for  the  impulse  input  of  Figure  16(b). 

Each  synthesized  vowel-like  utterance  is  generated  as  a 

sixth-order  AR  time  series  from  the  three  formant 

frequencies  supplied  by  the  Bell  Laboratories  data  (Barney, 

2 

1952).  A  Gaussian  white  noise  process,  v(n),  with  <7V  =  1.0 
is  used  as  the  excitation  function.  Bandwidths  for  FI,  F2, 


n 

c 

(0 


rage 


3  u-i  o 
JC  o  o  < 
ft  Cu 
•ft  B 
3  3  01 

.£3  ft  ~ 
C  ft 

O  O 

•r-l  0)  • 

4-1  •  Q,  ft 


3  3  ft  C 

u-i  o  O  •'i 


>  3 
ft  Q  t3 
©  13  ft  0 
u-i  <d  u 

01  N  H  — i 

C  -ft  (0  o 

<0  0)  3  > 

ft  ai  ft  c 

■ft  £  U  3 
■ft  < 

<ft  C  U-I 

O  >1—  o 

01  O 


<g  e  -ft  c 
ft  3  3  01 
--4  ft  Q,  -O 
o  -ft  c 


a  T3  ft  o 

<0  01  <U  4J  > 

•H  O  O 

>  ft  -ft  gi  T3 

©Oft© 

XJ  3  >  01  O 

CO  -ft 

3  Qi  *•  U  O 

O  O  ©  > 

05  T3  -ft  3  C 

0)  X>  O  3 

-ft  ft  O  Oi 

01  *ft  -ft  X3 

3  01  ft  *—  01 

O  01  01  T3  N 

>  Q  CU  w  .-4 


O  flJ  O  •  JC 

>—  -ft  r-4  ft 
03  ft  01  C 
•ft  03  3  >1 
03  •  -ft  O  03 

0)  4J  C  > 

jC  3  -ft  U-i 

■ft  O'  B  T3  O 

C  C  ft  (1) 

>i-ft  01  O  >, 
CO  ft  -ft  j_) 
01  01  O  -ft 
ft  *U  >  03 
•  -ft  C 

vo  r  ft  ’o  © 
H  3  O  ©13 

N 

ai  tj  e  — i  <-< 

ft  01  3  03  nj 

3  U  ft  0)  ft 

oi  -ft  ft  x:  ft 

•ft  O  O  ft  o 

Cu  >  ©  C  © 

c  a  >,  a 

3  03  03  01 


Page  47 


and  F3  are  held  constant  for  all  utterances  at  the  values 
52.0  Hz.,  66.0  Hz.,  and  120.0  Hz.,  respectively.  The 
process  is  generated  for  1000  samples.  These  samples  are 
then  fed  into  the  sixth-order  inverse  whitening  filter  (see 
Chapter  IV)  to  obtain  the  PARCOR  coefficients  which  are  used 
as  pattern  recognition  parameters  for  vowel  identification. 

A  system  block  diagram  of  the  computer  simulation  is  shown 
in  Figure  17. 
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CHAPTER  IV 


INVERSE  FILTER 

The  forward  PARCOR  coefficients  K®(n), 
i=  1 ,  2 ,  .  .  .  6  corresponding  to  each  utterance  are 
obtained  by  passing  1000  of  the  synthesized  vowel-like 
samples,  z(n),  through  a  sixth-order  inverse  (whitening) 
filter.  The  six  final  forward  PARCOR  coefficients  K®(1000) 
are  used  in  Chapter  5  as  pattern  recognition  parameters  for 
the  vowel  utterances. 


The  Linear  Prediction  Problem 

Rabiner  and  Schafer  (1972,  chap.  8)  present  the  linear 

prediction  problem;  for  the  digital  vocal  tract  model  of 

equation  1,  the  speech  samples,  z(n),  are  related  to  the 

white  input  samples,  v(n),  by  equation  2.  A  linear 

predictor  with  predictor  coefficients,  a-  ,  is  defined  as 
P  J 

z(n)=  £  a:z(n-j).  The  system  function  for  this  pth-order 

js|  P  „  • 

linear  predictor  is  P(z)=  £  a:  z~J  .  The  prediction  error, 

j1' 

e(n),  is  the  difference  between  the  speech  sample,  z(n),  and 

the  linearly  predicted  one,  z(n): 

P 

e(n)=  z(n)-z(n)=  z(n)-  5*  a-,  z(n-j).  The  error,  e(n),  is  the 

j*« 

output  of  a  system  with  transfer  function 

1  ^  P  ~ 

=  A ( z )  =  l-  l  a-  z-J  . 

H  (  Z  )  H 


which  recovers  the  white  input  (e(n)=v(n))  by  removing  the 
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correlation  between  samples  of  z{n).  The  error,  e(n),  will 

✓S  A 

approach  v(n)  as  a^  approach  aj  and  A(z)  will  be  the  inverse 
filter  for  H(z).  The  linear  prediction  problem  then  is  one 

/V 

of  finding  the  A(z)  which  minimizes  the  square  of  the 

exponentially  weighted  forward  prediction  error  given  by 
N 

Ep(N)=  £  (l-an<;i  )N'n  |  ep  ( n )  ( 2  (Hodgkiss  &  Presley,  1981). 
n=0 

This  leads  to  a  set  of  linear  equations  called  the  normal 
equations.  Complete  algebraic  derivations  of  the  least 
squares  lattice  equations  (from  which  this  was  adapted)  are 
found  in  Lee  (1980)  and  Pack  and  Satorius  (Note  4). 

The  Least  Squares  Lattice 

The  solution  of  the  normal  equations  is  dependent  on 
the  efficient  inversion  of  a  pxp  covariance  matrix  (Lee, 
1980).  Several  solution  algorithms  are  discussed  by  Morf, 
Lee,  Nickolls,  and  Vieira  (1977).  The  Levinson  (1947) 
algorithm  is  an  efficient  least  squares  simultaneous 
solution  to  the  normal  equations  requiring  only  0(p2) 
computations  per  time  update  (where  p  is  the  order  of  the 
filter)  for  a  stationary  process.  A  natural  implementation 
of  Levinson's  algorithm,  the  lattice  structure  (as  realized 
by  Itakura  and  Saito,  1971),  provides  an  extension  to  the 
non-stationary  case.  Lee  (1980)  and  Pack  and  Satorius  (Note 
4)  present  Levinson's  recursion  clearly.  A  class  of  fast 
exact  least  squares  algorithms  which  require  only  0(p) 
computations  per  time  update  are  discussed  in  the  literature 
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by  Morf,  Dickenson,  Kailath,  and  Vieira  (1977),  and  Morf, 
Lee,  Nickolls,  and  Vieira  (1977).  An  exact  time  update 
recursion  in  terms  of  lattice  variables  only  has  been 
developed  and  tested  by  Lee  (1  980  ),  Morf  and  Lee  (1  978), 
Morf,  Lee,  Nickolls,  and  Vieira  (1977),  and  Morf,  Vieira, 
and  Lee ,  ( 1 977 ) . 

Lattice  Structure 

A  feed  forward  (MA)  lattice  structure  (Figure  18)  is 

the  realization  of  Levinson’s  algorithm  for  the  computation 

of  the  optimal  linear  predictor.  The  lattice  is  composed  of 

a  cascade  of  p  individual  lattice  sections,  corresponding  to 

the  stages  (order)  of  the  algorithm,  i= 1 , 2 ,  .  .  .  p.  The 

variable  in  the  lower  path,  ej(n),  is  the  forward  error 

between  the  input,  z(n),  and  the  least  squares  (linearly 

predicted)  estimate  of  z(n),  “z(n),  based  on  a  linear 

P 

combination  of  past  inputs:  zXn)=£  a  z(n-j).  Likewise, 

j'l  1 

the  backward  prediction  error,  r-  (n),  propagates  backward 

along  the  upper  path.  The  variables  represented  by  the 

cross  bars  of  the  lattice  are  the  forward  and  backward 

partial  correlation  or  PARCOR  coefficients  which  arise 

naturally  as  intermediate  entities  in  the  solution  of  the 

g  r 

Levinson  algorithm.  For  the  least  squares  lattice,  Kj*Kj 
and  |  Kf  ,  K*j  |  <  1  for  i=l,2,  .  .  .  p. 
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Lattice  Variables 

Appendix  B  presents  the  least  squares  complex  adaptive 
lattice  variables  and  equations  as  adapted  from  Lee,  Morf, 
and  Friedlander  (1981)  for  complex  data  by  Hodgkiss  and 
Presley  (1982).  The  software  supplied  by  Alexandrou  and 
Hodgkiss  (Note  1)  for  the  computer  simulation  is  based  on 
these  equations. 


Fade  factor.  The  fade  factor,  d-aCLSL^'  applies  an 
exponential  weighting  on  the  data  by  weighting  recent  errors 
more  heavily  than  those  in  the  distant  past.  For  this  study 
the  value  of  (Xclsl  was  c}losen  to  be  .0001  ,  although  the 
choice  is  not  critical  here  as  the  time  series  are  all 
stationary.  Bounded  by  [0,1],  (l-aCLS1_)  is  usually  close  to 
1;  the  inverse  of  a^LSL  is  approximately  the  memory  of  the 
algorithm  (Pack  &  Satorius,  Note  4).  The  value  of  aQ_5L  may 
be  selected  to  satisfy  a  misad justment  criterion  (Hodgkiss  & 
Presley  ,  1  982 ) . 


Likelihood  variable.  A  major  difference  between 
gradient  lattice  developed  by  Griffiths  (1  975,  1977)  and  LSL 


algorithms  is  the  Gaussian  likelihood  parameter,  y,  (n-1), 

i-2 


which  replaces  the  constant  step  size  of  the  gradient 
lattice  and  is  responsible  for  the  fast  tracking 
capabilities  of  the  LSL.  For  likely  samples  (the  lower 
bound  of  y,  ^  ( n— 1  )=  0  is  reached  for  acLsi_  =  ^'  steP  s*ze 
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is  small  and  constant,  roughly  on  the  order  of  magnitude  of 
the  "optimal"  gradient  step  size.  For  unlikely  samples  (the 
upper  bound  of  y-_2  (n-l)=l  is  reached  for  acLSl  =  1  ^  ' 
yj_2(n-l)  will  approach  unity;  the  gain,  l/(  1- y(_2  ( n-1 )  )  , 
is  very  large,  causing  the  lattice  parameters  to  change 
quickly  to  adapt  to  sudden  changes  in  the  input  data 
(Hodgkiss  &  Presley,  1981).  The  values  of  both  aCLS[_  and 
y-  _  2  ( n  —  1 )  will  become  critical  if  the  study  is  extended  such 
that  we  proceed  to  examine  the  time-varying  behavior  of  the 
PARCOR  coefficients  for  a  nonstationary  input  time  series. 

Partial  correlation  (PARCOR)  coefficients.  The 
variable  A;(n)  is  known  as  the  ith-order  partial 
autocorrelation  between  z(n)  and  z(n-i-l),  and  is  defined  as 
the  correlation  betweeen  these  two  samples  after  removing 
their  mutual  linear  dependence  on  intervening  samples.  The 
partial  correlation  (PARCOR)  coefficients  K®  and  are  the 
partial  autocorrelations  normalized  by  E®_,  (n)  and 
E-_,  (n-1)  . 

Performance  Measures  for  the  Lattice 

Various  performance  measures  may  be  employed  to 
evaluate  the  whitening  properties  of  the  lattice  and  the 
accuracy  with  which  the  predictor  coefficients,  ,  identify 
the  transfer  function  of  the  prefilter,  assuming  a  white 
input  to  the  prefilter.  Two  example  cases  are  presented  in 
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Table  6,  for  the  vowels  /i/  and  /u/.  The  pth-stage  mean 

2 

square  error  (Hodgkiss  &  Presley,  1982),  E[|ep(n)|  ],  may  be 

plotted  at  each  time  n.  As  a  quantitative  measure  of 

convergence  time,  the  10%  settling  time  may  be  determined  as 

the  time  at  which  the  mean  square  error  comes  within  10%  of 

the  theoretical  steady  state  value.  The  mean  square  error 

for  Example  1  is  plotted  in  Figure  19.  The  final  (pth- 

stage)  error  power  (Hodgkiss  &  Presley,  1982)  after 

convergence,  E  [  | ep( n)  |  ]=  E  [  |  v(  n)  |  ]=  <r|  ,  will  be  an  estimate 

of  the  variance  of  the  prefilter  input  signal, 

E[  |  v(n)  |  ]=  cr^  .  For  zero-mean  Gaussian  white  noise  input 
2 

withcrv-l,  cre  should  approach  one  for  a  true  whitening 
filter.  The  misad  justment ,  |  cr  2  -  cr  2/cr  2  |  ,  is  also  a  popular 
performance  measure.  The  misad justment  after  1000  samples 
is  8%  and  9%  for.  Example  1  and  Example  2,  respectively. 

When  the  filter  transfer  functions  are  realized  from 
the  filter  coefficients,  the  plot  of  the  transfer  function 
is  a  good  qualitative  performance  measure.  Inverted,  the 
magnitude  of  the  transfer  function  of  the  lattice  filter, 
|‘A(z)|,  is  an  approximation  of  the  prefilter  (vocal  tract 
model)  transfer  function,  | 1/A ( z ) | .  This  series  of  transfer 
functions  is  presented  in  Figure  20  for  Example  1  and  in 
Figure  21  for  Example  2. 

The  power  spectral  density  may  also  be  used  as  another 


qualitative  performance  measure.  For  a  prefilter  input 


Table  6 

Selected  Filter  Parameters  for 
Two  Example  Vowel  Utterances 
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j 

1 

2 

3 

4 

5 

6 

Example 

1,  vowel 

/i/ 

F 

244 

2300 

2780 

— 

— 

— 

F 

52 

66 

120 

— 

— 

— 

a 

-0. 3717 

0. 3330 

-1 . 592 

0.3281 

-0.2648 

0.8295 

a  (N) 

-0. 360  4 

0 . 3449 

-1.585 

0 . 3332 

-0 .2635 

0.8145 

K  (N) 

-0.24  98 

0.0  314 

-0.8441 

0 .1920 

0  .0860 

0.8125 

Kj 

0 . 3653 

0 .50  84 

0 .0628 

0 . 5  90  2 

0 .5362 

0 . 9060 

Final 

mean  square 

;  error 

E  [  | e6  ( 1000  )  |2  ]  = 

0.9178  = 

.  3723  dB 

10%  settling  time 

i  =  757 

Example 

2,  vowel 

/u/ 

F 

340 

950 

2240 

— 

— 

— 

F 

52 

66 

120 

— 

— 

— 

a 

-2. 964 

4.337 

-4.541 

3.  978 

-2.558 

0.8295 

a  (N) 

-2. 919 

4.224 

-4. 381 

3.817 

-2.459 

0.8077 

K  (N) 

-0. 9255 

0.86  91 

-0 .6304 

0.4167 

-0 .2982 

0.8006 

Kj 

0.0214 

0. 934  9 

0.1716 

0.7046 

0 . 3407 

0 . 9000 

Final 

mean  square  error 

E [ | e6 ( 1000  )  j2]  = 

.9119=  . 

4003  dB 

10%  settling  time  =  766 


Figure  20.  Transfer  functions  for  Example  1,  Table  6.  (a)  Prefilter  transfer 

function,  |H(z)|.  (b)  Lattice  filter  transfer  function,  1/|h(z)|.  (c)  Inverted  lattice 

filter  transfer  function,  |fr(z)|,  is  an  estimate  of  the  prefilter  transfer  function  H(z) 
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which  is  zero-mean  Gaussian  white  noise,  the  power  spectral 

2 

density  of  the  prefilter  input  is  proportional  to  itv  ,  a 
constant:  Pv(z)=10  log  C  crv^.  The  prefilter  output  (lattice 

input)  power  spectral  density  is  given  by  Papoulis  (1981)  as 


P2  ( z  )=  10  log 


C  cr 


|A(z)  |4 


* 


* 


An  estimate  of  the  prefilter  input  power  spectral  density  is 
the  power  spectral  density  of  the  lattice  output 
Pe(z)=10  log  Pz(z)|A(z)|  .  It  should  be  flat  (constant)  for 
optimal  whitening.  The  power  spectral  density  series  for 
the  two  examples  are  shown  in  Figures  22  and  23.  An 
estimate  of  the  prefilter  output  power  spectral  density  is 
given  by  Griffiths  (1975)  as 

/V  Co-e2 

P2  ( z )-  10  log  — - j  . 

| A( z )  | 
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CHAPTER  5 

RESULTS 

For  each  synthesized  utterance,  the  least  squares 
lattice  computes  a  set  of  PARCOR  coefficients  at  each  time 
update.  The  six  forward  PARCOR  coefficients  at  the  last 
time  update  K® ( N)= K® ( 1000 )  obtained  as  the  output  of  the 
lattice,  are  the  classification  parameters  in  this  study. 
Because  the  PARCOR  coefficients  span  the  range  [-0.9674, 
0.9971],  they  are  normalized  to  span  the  interval  [0,1]  to 
facillitate  comparisons  of  distance  measures  and  cluster 
sizes  between  formant  clusters  and  PARCOR  clusters. 
Henceforth,  the  notation  Kl,  K2,  .  .  .  K6  will  denote 
normalized  K®(1000);  i= 1 ,  2,  .  .  .  6.  Table  2  lists  the 
normalized  ranges  of  all  the  PARCOR  coefficients. 

Analysis  of  the  PARCOR  Coefficient  Data 

The  PARCOR  coefficients  for  all  of  the  synthesized 
utterances  are  analyzed  both  graphically  and  quantitatively 
in  the  same  manner  as  the  formant  frequencies  were  analyzed. 
Distance  measures  are  computed  in  two  (Kl,K2),  three 
(Kl,K2,K3),  and  six  (K1-K6)  dimensions  in  a  manner  analogous 
to  the  computation  of  the  two  (F1,F2)  and  three  (Fl,F2,F3) 
dimensional  formant  frequency  distance  measures.  The  data 
for  the  synthesized  vowel-like  sounds  are  shown  in  the 
K1-K2,  K1-K3,  K1-K4,  K1-K5,  and  K1-K6  planes  in 
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Figures  24-28. 

Graphical  Representation 

From  a  graphical  analysis,  vowels  are  most  separable  in 
the  K1-K2  plane,  except  for  the  vowel  /?/,  for  which  K3  is 
quite  low  relative  to  that  of  the  other  vowels  and  which  may 
be  differentiated  by  its  location  in  the  three-dimensional 
space  defined  by  K1,K2  and  K3.  Figures  29-38  present  the 
synthesized  vowel-like  sounds,  singly,  in  the  K1-K2  plane. 
The  vowel  /3f/  in  the  K1-K3  plane  is  shown  in  Figure  39. 
Graphically,  none  of  the  other  PARCOR  coefficients  further 
separates  the  vowels.  As  for  the  formant  space  vowel 
clusters,  the  precise  vowel  cluster  areas  enclosed  on  the 
PARCOR  plots  are  arbitrary,  intended  to  indicate  a  general 
cluster  shape  for  the  purpose  of  evaluating  separability  in 
a  graphical,  qualitative  manner.  The  range  of  K2  is 
greatest,  followed  by  that  of  K1  (as  for  the  formant 
frequencies).  Kl  includes  the  highest  value  of  PARCOR 
coefficient;  K2  includes  the  lowest.  The  range  of  K6  is 
the  smallest  of  the  PARCOR  ranges.  The  ranges  spanned  by 
the  various  PARCOR  coefficients  are  in  accordance  with  the 
results  of  Tohkura  and  Itakura  (1  97  9)  who  noted  that  the 
spectral  sensitivity  for  the  first  PARCOR  is  often  quite 
high,  and  its  distribution  is  wider  than  that  of  the  higher 
order  PARCOR  coefficients. 
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ure  24.  Clustering  of  the  ten  English  vowels  in  the 
2  plane. 


NORMALIZED  K2 


t. 


0. 


.2 


.4 


.6 


.  I 


.3 


.6 


.7 


NORMALIZED  K1 


Figure  25.  Clustering  of  the  ten  English  vowels  in  the 
K1-K3  plane. 
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Figure  26. 
K1-K4  plane. 
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Figure  27. 
K1-K5  plane. 
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Figure  33.  Clustering  of  the  vowel  /a/  in  the  K1-K2  plane. 


NORMALIZED  K2 


.2 


.4 


.6 


NORMALIZED  K1 


Figure  37.  Clustering  of  the  vowel  /A/  in  the  K1-K2  plane. 
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Distance  Measures 

The  same  selected  quantitative  measures  of  the  cluster 
sizes  and  relationships  which  are  computed  for  formant 
frequency  clusters  are  also  computed  in  two,  three,  and  six 
dimensions  for  the  PARCOR  coefficient  clusters.  Table  3 
(p.  35)  presents  the  average  intracluster  distance  for  each 
vowel,  for  two,  three,  and  six-dimensional  cases.  As  for 
the  formant  frequencies,  intracluster  distances  are  minimum 
(indicating  cluster  compactness)  for  every  vowel  in  two 
dimensions.  Table  4  (p.  36)  presents  the  intercluster 
distances  for  selected  adjacent  vowel  pairs  for  the  two, 
three,  and  six-dimensional  cases.  As  for  the  formant 
frequencies,  intercluster  distances  are  maximum  (indicating 
separability)  in  the  largest  number  of  dimensions,  which, 
for  the  PARCOR  coefficients  is  the  six-dimensional  case. 

The  efficiency  with  which  the  PARCOR  coefficients 
represent  the  vowel  clusters  may  be  compared  to  that 
exhibited  by  the  formant  frequencies  by  studying  Tables  3 
and  4  (pp.  35-36).  The  average  intracluster  distance 
(Table  3)  should  be  minimized  and  the  intercluster  distances 
(Table  4)  should  be  maximized. 

As  a  combined  measure  of  compactness  and  separability, 
The  ratio  of  the  sum  of  average  intracluster  distances  to 
intercluster  distance  for  adjacent-vowel  pairs  is  computed 
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for  formants  and  PARCOR  coeffients  in  two,  three,  and  six 
dimensions.  This  parameter  should  be  minimized  for  clusters 
which  are  both  separable  and  compact;  it  is  greater  than 
one  for  clusters  which  are  less  so.  Quantitatively, 
analysis  of  this  parameter  (Table  5,  p.  37)  indicates  a 
smaller  ratio  for  PARCOR  coefficients  than  for  formants  for 
five  of  the  twelve  ad jacent-vowel  pairs.  In  other  words, 
the  PARCOR  coefficient  clusters  are  roughly  equivalent  the 
the  formant  frequency  clusters  in  terms  of  their  compactness 
and  separability.  The  actual  number  of  dimensions  in  which 
the  smaller  ratios  are  obtained  varies  over  the  vowel  pairs. 
The  use  of  this  ratio  must  be  coupled  with  a  qualitative 
assessment  of  the  vowel  clusters.  For  instance,  although 
the  vowels  /i/  and  /I/  in  the  K1-K2  plane  in  Figures  29  and 
30  are  both  widely  dispersed,  inspection  of  Figure  24 
reveals  that  the  vowel-like  sounds  may  be  separated  in  the 
K1-K2  plane.  The  combined  ratio  for  the  pair  i-I,  however, 
suffers  because  the  coefficients  are  so  widely  dispersed. 
Inspection  of  the  analogous  figures  (2,4,  and  5)  for  the 
formant  frequencies  suggests  that  the  formant  space 
representation  is  about  equivalent  to  the  PARCOR  coefficient 
representation,  yet  the  combined  ratio  in  the  formant  space 
is  smaller  due  to  the  more  compact  nature  of  the  clusters. 
The  elliptical  shapes  of  the  clusters  also  contribute  to  the 
inaccuracy  of  this  type  of  measurement,  since  it  is  more 
suited  to  clusters  which  are  symmetric. 
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An  item  by  item  comparison  between  the  columns  in  each 
Table  is  meaningful,  specifically  the  two-dimensional 
columns  in  Table  3  and  the  F{1,2,3)  and  K(l-6)  columns  in 
Table  4.  This  method  of  evaluating  the  two  systems  of 
classifying  the  vowels  is  more  meaningful  and  informative  in 
conjunction  with  a  graphical  assessment  since  the  results 
are  not  consistently  in  favor  of  one  system  or  the  other. 

The  vowel  clusters  are  not  consistently  more  compact  or  well 
separated  in  one  domain  than  in  the  other.  In  other  words, 
the  PARCOR  coefficient  representation  of  the  vowels  is  about 
equivalent  to  the  formant  representation. 

It  is  possible  that  taking  the  logarithm  of  the  PARCOR 
coefficients  would  cause  them  to  cluster  more  compactly, 
since  the  clusters  have  an  elliptical  shape.  This  is  not 
done,  however,  because  there  is  no  physical  justification 
for  the  transformation  (such  as  the  nonlinearity  of  the  ear 
in  the  case  of  the  formant  frequencies). 

Various  distance  measures  have  been  employed  by 
researchers  in  the  speech  field  to  assess  the  similarity 
between  two  utterances.  These  distance  measures  are 
commonly  computed  from  the  linear  predictor  coefficients, 
a-  ,  (Atal  &  Rabiner,  1  976;  Gray  St  Markel,  1  976;  Levinson, 
Rabiner,  Rosenberg  &  Wilpon,  1  979;  Tribolet , Rabiner ,  St 
Sondhi,  1  97  9)  The  physical  significance  of  the  PARCOR 
coefficients  lends  credibility  to 


their  use  as  an 
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alternative  vehicle  for  assessing  similarity  between 
utterances.  A  quantitative  comparison  between  the  various 
measurements  is  warranted,  based  on  the  results  presented 
here  which  indicate  that  the  PARCOR  coefficients  may  be 
equivalent  to  or  better  than  one  or  more  of  the  more  widely 
used  feature  parameters. 


CHAPTER  VI 


SUMMARY  AND  CONCLUSIONS 

It  was  shown  by  Potter  and  Steinberg  (1950)  and  by 
Peterson  and  Barney  (1952)  that  the  vowel  formant 
frequencies  Fl  and  F2  tend  to  cluster  by  vowel  when  plotted 
for  different  speakers.  Formant  frequency  data  measured  by 
these  researchers  for  utterances  by  male  speakers  were 
obtained  and  analyzed  by  this  author  quantitatively  as  well 
as  graphically.  The  results  obtained  by  the  Bell 
Laboratories  researchers  (Peterson  &  Barney,  1952;  Potter  & 
Steinberg,  1  950  )  are  verified;  the  first  nine  vowel 
clusters  are  defined  in  the  space  defined  by  Fl  and  F2, 
whereas  the  third  formant  is  necessary  for  identification  of 
the  tenth  vowel,  /3f/.  The  average  intracluster  distance  for 
each  vowel  cluster  yields  smaller  values  for  each  vowel  when 
computed  in  two  dimensions  rather  than  three,  indicating 
that  the  clusters  are  most  compact  in  two  dimensions. 
Intercluster  distances  between  adjacent  vowel  pairs  were 
computed  as  a  measure  of  vowel  separability  and  found  to  be 
maximum  in  three  dimensions  for  all  of  the  pairs. 

Each  of  the  utterances  was  then  reproduced  from  the 
formant  frequencies  as  an  autoregressive  time  series  by  a 
six-pole  HR  recursive  digital  filter.  These  time  series 
were  then  inverse  filtered  with  a  six-zero  complex  adaptive 


lattice  filter  adapted  from  Alexandrou  &  Hodgkiss  (Note  1) 
which  yielded  a  whitened  output  signal.  The  partial 
correlation  (PARCOR)  coefficients  from  this  lattice  filter 
were  shown  to  cluster  by  vowel  in  the  space  defined  by  these 
coefficients  . 

Graphically,  the  first  two  coefficients,  K1  and  K2,  are 
sufficient  to  identify  the  first  nine  vowels,  whereas  the 
third  PARCOR  coefficient,  K3,  is  necessary  to  distinguish 
the  tenth  vowel,  /!/ ,  from  the  other  nine.  The  results  of  a 
numerical  analysis  of  the  PARCOR  coefficients  were  analagous 
to  those  found  for  the  formant  frequencies.  From 
calculations  of  average  intracluster  distance  for  each 
vowel,  it  was  determined  that  the  clusters  are  most  compact 
in  two  dimensions  (K1,K2).  Calculations  of  intercluster 
distance  between  adjacent-vowel  pairs  show  maximum  cluster 
separation  in  the  largest  number  of  dimensions  (six).  The 
ratio  of  the  sum  of  intracluster  distances  to  intercluster 
distance  for  each  of  the  adjacent-vowel  pairs  indicates  that 
the  PARCOR  coefficient  representation  is  as  effective  or 
better  than  the  formant  frequency  representation  for  five  of 
the  twelve  adjacent-vowel  pairs.  It  is  apparent  then,  that 
the  use  of  the  PARCOR  coefficients  for  identification  of 
synthesized  steady  state  vowel-like  sounds  is  as  effective 
as  identification  via  formant  frequencies.  The  PARCOR 
coefficient  technique  for  the  identification  of  steady  state 
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synthesized  vowel-like  sounds  is  a  much  quicker  and  more 
computationally  efficient  method  than  that  involving 
computation  of  poles  and  zeros  and  back  calculation  of 
formant  frequencies  and  bandwidths.  This  is  very  important 
in  real-time  identification  of  non-stationary  signals. 

Limitations  of  the  Study 

The  least  squares  lattice  is  an  optimal  whitening 
filter  for  an  AR  process  when  the  order  of  the  lattice 
(number  of  zeros)  is  equal  to  the  order  of  the  AR  process 
(number  of  poles).  For  this  study  this  is  the  case,  as  it 
is  desired  to  obtain  the  PARCOR  coefficients  for  inputs  of 
known  order.  However,  for  an  input  signal  whose  origin  is 
not  known,  the  performance  of  the  filter  will  depend  highly 
on  the  order  which  is  selected. 

Another  simplification  made  in  this  study  for  the 
purpose  of  exactly  matching  the  input  and  output  transfer 
functions  is  the  modeling  of  the  speech  signal  as  an  AR 
process.  This  is  a  commonly  used  representation  in  the 
literature,  although  it  is  extremely  simplified.  The 
assumption  of  an  AR  process  may  only  be  made  for  non-nasal 
sounds  because  the  coupling  of  nasal  cavities  during 
production  of  nasalized  sounds  adds  an  anti  resonance  or  zero 
to  the  speeech  spectrum  (Denes  &  Pinson,  1  96  3).  The  process 
may  no  longer  be  accurately  modeled  as  all-pole.  However, 
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researchers  have  commonly  used  an  all-pole  representation  of 
higher  order  (Atal  &  Hanauer,  1971;  Friedlander,  1982;  Kay 
&  Marple,  1981;  Rabiner  &  Schafer,  1  978)  for  this  purpose 
as  an  a oproximation  to  a  more  desirable  pole-zero  (ARMA) 
model  because  AR  models  are  much  easier  to  use. 

The  most  severe  limitation  of  the  study  is  the  fact 
that  synthesized  speech-like  sounds  (rather  than  actual 
speech  sounds)  are  used.  Although  synthesized  sounds  are 
used  intentionally  for  the  specific  purpose  of  establishing 
the  PARCOR  coefficients  as  equivalent  to  formant  frequencies 
as  pattern  recognition  features,  further  studies  need  to 
concentrate  efforts  on  identifying  actual  spoken  speech. 
Actual  speech  cannot  be  accurately  modeled  as  AR  (even 
steady-state  vowel  sounds)  because  the  spectrum  will  contain 
extra  poles  and  zeros  which  are  contributed  by  the  following 
factors;  higher  formant  frequencies,  lip  radiation,  actual 
(not  flat)  excitation  spectrum,  aperiodicity  of  the 
excitation  function,  damping  of  the  vocal  tract,  any 
laryngeal  pathologies,  measurement  difficulty  and  error, 
transmission  loss  between  lips  and  microphone,  and 
inaccuracies  in  the  mathematical  speech  production  model 
(Dunn,  1961;  Fant,  1  956,1  959,  1  963;  Peterson,  1  959; 

Rabiner  &  Schafer,  1978,  chap.  3).  An  ARMA  lattice  is 
appropriate  in  the  case  of  actual  speech  in  order  to  more 
accurately  estimate  the  spectrum.  Friedlander  and  Mitra 
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(1981)  have  used  an  ARMA  lattice  for  the  identification  of  2 

actual  spoken  nasal  sounds;  the  results  compared  favorably 
with  those  obtained  by  using  a  high  order  AR  lattice.  This 
is  also  discussed  by  Fallside  and  Brooks  (1976),  Green  i 

(1  976),  and  Markel  and  Gray  (1976  ).  ARMA  lattice  algorithms 
are  presented  by  Lee,  Friedlander  and  Morf,  (1980),  Morf, 

Lee,  Nickolls,  and  Vieira  (1977),  and  Morf,  Vieira  and  Lee 
(1977)  . 

Suggestions  for  Future  Research 

Very  few  speech  sounds  are  steady  state,  and  only  very 
briefly  if  at  all.  Use  of  the  adaptive  capability  of  the 
lattice  filter  with  appropriate  selection  of  the  fade 
factor,  (l-aCLSL),  enable  the  results  of  vowel 

identification  studies  to  be  extended  to  simplify  the 
identification  of  more  complex  time-varying  sounds. 

Dipthongs  are  commonly  identified  by  researchers  in  the 
speech  field  (Potter,  Kopp  &  Kopp,  1961)  by  the  time-varying 
paths  of  their  second  formant  frequencies  between  formant 
locations  for  the  two  composite  vowel  sounds.  It  seems 
reasonable  that  this  could  be  transformed  into  an 
identification  via  time-varying  PARCOR  coeff icient(s) . 

Turner  (1982)  has  used  the  time-varying  behavior  of  the 
PARCOR  coefficients  to  identify  stop  consonants,  which  are 
characterized  by  an  even  more  complicated  time-varying 
frequency  spectrum.  It  is  likely  that  with  extensions  to  a 
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higher  order  AR  or  ARMA  lattice  the  technique  of  using  the 
PARCOR  coefficients  as  pattern  recognition  features  would  be 
well  suited  to  a  multitude  of  applications  in  the  field  of 
signal  processing.  Whether  used  to  identify  stationary 
signals  or  to  adaptively  identify  and  track  any  type  of 
acoustic  signal,  the  PARCOR  coefficients  of  the  complex 
adaptive  least  squares  lattice  conveniently  and  efficiently 
represent  time  domain  signals.  In  a  purely  pattern 
recognition  context,  the  PARCOR  coefficients  are  valuable 
pattern  recognition  features  in  situations  where  the 
frequency  spectrum  or  pole  locations  are  meaningless. 

Deller  and  Anderson (1 980 )  identified  types  of  laryngeal 
pathologies  by  looking  at  clusters  of  z-plane  pole  locations 
in  and  on  the  unit  circle.  The  actual  pole  locations  have  a 
complicated  relationship  to  the  actual  pathology;  in  their 
case,  all  that  was  needed  was  a  clustering  parameter  to 
identify  outlying  points  and  types  of  clusters. 

The  identification  of  synthesized  steady  state 
vowel-like  sounds  is  a  first  step  in  the  process  of  speech 
identification.  The  clustering  properties  of  the  PARCOR 
coefficients  which  are  demonstrated  in  this  research  for  the 
purpose  of  vowel  identification  show  the  PARCOR  coefficients 
to  be  an  effective  and  efficient  vehicle  for  the 
representation  and  transmission  of  frequency  spectra 
information.  It  is  hoped  that  these  results  will  inspire 
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other  researchers  to  extend  the  study  to  enable  the 
simplification  of  other  more  complex  system  identification 
problems . 
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APPENDIX  A 

Selected  Measures  of  Vowel  Cluster  Size  and 
Vowel  Cluster  Separability0 

Given  Nc  clusters  of  p-dimensional  pattern  vectors 
xkj  where  xkj  n=  ,2,  ...  p  is  the  nth  component  of  that 
vector  and  Nj  is  the  number  of  points  in  the  jth  cluster, 
the  centroid  vector  of  the  jth  cluster  Cj  is 


The  intercluster  distance  between  the  ith  and  jth 
cluster  is  the  Euclidean  distance  between  the  centroids  of 
the  clusters: 

p  _  _  -  \/2  T  t  1  1/2 

Dij  3  I  I  -2i  I  Is  C  |  zf-zj  | 2  )  =  |^(2|  -Zj  )f  (zj  ~Zj  • 

The  intracluster  distance  is  the  Euclidean  distance 
from  the  kth  vector  in  the  jth  cluster  to  the  mean  of  that 


cluster: 
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The  average  intracluster  distance  for  the  jth  cluster  is 


1  5 

Dj  Y  Dfcj  . 
Nj  M 


from  Tou  and  Gonzales  (1974,  p.  77) 


Page  106 


APPENDIX  B 

Least  Squares  Lattice  Equations0 
Initialization  (i=0,l,  .  .  .  p) 
r-j  ( — 1  )=  0  r  i*  p 

Ejf  (-1  )=«CLSL  #  €  CLSL  =  0 . 00 1  and  i*  p 

A-(-D=0/ 

jjjt'  ( — 1  )*  0  r  0£k£i-l ,  i*  0 ,  and  i*  p. 

Time  update  (n>0) 
e0(n)=r0  (n)=x(n) 

E®  ( n)=  Eg  ( n  )=  (l-aCLSL  )eJ  (n-l)  +  |x(n)  |2 
7- 1  ( n-1 )-  0  . 

Order  update  (  i=  1 , 2 ,  .  .  .  ,  p). 

e.  .  (n)-r*.  (n-1) 

Aj  (n)=  (l-aCLSL)Ai(n-l) - 

K®  (n)=A*(n)/E*_,  (n) 

Kj  ( n  )=  A;  ( n )  /e|L,  ( n- 1 ) 
ej  (n)=»eH  (n)+Kjr  ( n )  r; (n-1) 
rj(n)=rH  (n-1  )+K*  (n)ej.,  (n) 

E®  ( n)=  E®_,  (n)-|A;  (n)  |2/e[_,  (n-1) 

Ef(n)-EjrH  (n-l)-|A,  (n)  |2/e[„,  (n) 

7j_,  (n-1)*  y._2  (n-1 )  + 1  r.-(  ( n-1 )  |2/E.r_ ,  ( n-1 ) . 
afj*  (n)=Kj  (n) 
bjj1  ( n  )=  Kf  ( n ) 

( n  )s  a(jf ' ]  ( n  1  +K'  ( n )  b<K-|)  ( n“ 1 } 

b(*  ( n  )=  ( n-1 )  +Kf  ( n )  a(jf 0  ( n ) 


(4a) 

(4b) 

(4c) 

( 4d ) 

(4e) 

(  4  f ) 
(4g) 

(4h) 

( 4  i ) 
( 4  j  ) 
(4k) 
(41) 
(4m) 
(4n) 
(4o) 
( 4p) 
( 4q ) 
(4r) 
(4s) 
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Least  Squares  Lattice  Variables1* 
Lattice  Parameter0 

Number  of  time  samples  (iterations) 

Filter  order 
Time  variable 

Stage  variable,  current  stage 
Stage  variable,  lower  stages 
Input  time  sample 
Gain 

Fade  factor 

Step  size  parameter 

Forward  predictor  coefficient  vector 

Backward  predictor  coefficient  vector 

Highest  (ith)  forward  predictor 

Previous  vector  of  backward  predictors 

Forward  power 

Backward  power 

Forward  PARCOR  coefficient 

Backward  PARCOR  coefficient 


Symbol 


x(n) 

yi_2(n-1) 

{ 1_aCLSL  * 
Aj(n) 

a(*  ( n ) 

Hk£i-1 

a*;1  (n) 

■t!1 1  "-1  > 

E-  (n) 
E?(n) 

Kf  (n) 

K-  (n) 


Forward  error  ej  (n) 

Backward  error  r(  (n) 

°from  Hodgkiss  &  Presley  (1982,  pp.  331-332) 
badapted  from  Hodgkiss  &  Presley  (1982,  pp.  331-332) 
cVariables  pertain  to  ith  stage  unless  otherwise  noted. 
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